-
-
Notifications
You must be signed in to change notification settings - Fork 443
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable testing of new scheduler in CI #6428
base: master
Are you sure you want to change the base?
Conversation
Performance test reportHPX PerformanceComparison
Info
Comparison
Info
Comparison
Info
Explanation of Symbols
|
Coverage summary from CodacySee diff coverage on Codacy
Coverage variation details
Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch: Diff coverage details
Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified: See your quality gate settings Change summary preferencesYou may notice some variations in coverage metrics with the latest Coverage engine update. For more details, visit the documentation |
@Pansysk75 it looks like that the new scheduler doesn't properly work yet (as you suggested). the clang-12 and gcc-10 errors are most likely related. |
retest |
@hkaiser FYI I'm working on this one, I've reproduced some failing tests ( |
@hkaiser It seems like the moodycamel queue provides FIFO guarantees only among items placed by the same thread, however there is no guarantee that items queued by different threads will be dequeued in a FIFO order. From the moodycamel blog:
And from the discussion in the comments:
This currently causes trouble (deadlocks), as we rely on the FIFO ordering to avoid starvation: hpx/libs/core/thread_pools/include/hpx/thread_pools/scheduling_loop.hpp Lines 325 to 332 in bd9e3cb
In the fail-case that I managed to reproduce, a perpetually yielding thread starves out a shared-resource-holding thread when these two happen to be in the same moodycamel thread-queue. We cannot reliably insert the yielding thread in the"back" of the queue (no FIFO guarantee), and the queue under certain circumstances acts similar to a LIFO, thus never allowing the shared-resource-holding thread to be picked up by the scheduler and release the held resource. |
Excellent detective work @Pansysk75! Is that true in case if we used the consumer/producer token API the queue provides as well? |
Thanks! Here's what the implementer has to say about this (TLDR it's not possible): https://www.github.com/cameron314/concurrentqueue/issues/6 |
Would this patch mitigate the issue? diff --git a/libs/core/thread_pools/include/hpx/thread_pools/scheduling_loop.hpp b/libs/core/thread_pools/include/hpx/thread_pools/scheduling_loop.hpp
index 1ca53941ee4..d2cc74fae4e 100644
--- a/libs/core/thread_pools/include/hpx/thread_pools/scheduling_loop.hpp
+++ b/libs/core/thread_pools/include/hpx/thread_pools/scheduling_loop.hpp
@@ -315,13 +315,6 @@ namespace hpx::threads::detail {
if (HPX_UNLIKELY(
state_val == thread_schedule_state::pending))
{
- if (HPX_LIKELY(next_thrd == nullptr))
- {
- // schedule other work
- scheduler.wait_or_add_new(num_thread, running,
- idle_loop_count, enable_stealing_staged, added);
- }
-
// schedule this thread again, make sure it ends up at
// the end of the queue
scheduler.SchedulingPolicy::schedule_thread_last(
@@ -329,6 +322,14 @@ namespace hpx::threads::detail {
threads::thread_schedule_hint(
static_cast<std::int16_t>(num_thread)),
true);
+
+ if (HPX_LIKELY(next_thrd == nullptr))
+ {
+ // schedule other work
+ scheduler.wait_or_add_new(num_thread, running,
+ idle_loop_count, enable_stealing_staged, added);
+ }
+
scheduler.SchedulingPolicy::do_some_work(num_thread);
}
else if (HPX_UNLIKELY(state_val == I think the problem is that currently after the pending thread has been put back onto the (top of) queue the scheduler immediately pulls it back and retries without potentially waiting threads having a change of being work-requested. For other schedulers those will get stolen eventually, thus avoiding the live-lock. |
@hkaiser Doesn't seem to help much... keep in mind that this issue doesn't show up on the workrequesting scheduler when using the default queue, so it's either the mc queue or the stealhalf functionality that is causing the issue (I also tried setting this to false
Also, I think this fix is needed (or sth equivalent): task_data thrds(d.num_thread_);
- thrds.tasks_.reserve(max_num_to_steal);
#ifdef HPX_HAVE_THREAD_STEALING_COUNTS
+ thrds.tasks_.reserve(max_num_to_steal);
thread_id_ref_type thrd;
while (max_num_to_steal-- != 0 &&
d.queue_->get_next_thread(thrd, false, true))
{
d.queue_->increment_num_stolen_from_pending();
thrds.tasks_.push_back(HPX_MOVE(thrd));
thrd = thread_id_ref_type{};
}
#else
+ thrds.tasks_.resize(max_num_to_steal);
d.queue_->get_next_threads(
thrds.tasks_.begin(), thrds.tasks_.size(), false, true); |
Uhh, how did it ever work? ;-) |
78626f0
to
1b888c6
Compare
That's fixed now. Thanks again! |
1b888c6
to
a0dca56
Compare
Coverage summary from CodacySee diff coverage on Codacy
Coverage variation details
Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch: Diff coverage details
Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified: See your quality gate settings Change summary preferencesCodacy will stop sending the deprecated coverage status from June 5th, 2024. Learn more |
- flyby: fixing buffer allocation
a0dca56
to
2e610e7
Compare
Performance test reportHPX PerformanceComparison
Info
Comparison
Info
Comparison
Info
Explanation of Symbols
|
No description provided.